Goto

Collaborating Authors

 red team


For Those Who May Find Themselves on the Red Team

Shoemaker, Tyler

arXiv.org Artificial Intelligence

This position paper argues that literary scholars must engage with large language model (LLM) interpretability research. While doing so will involve ideological struggle, if not out-right complicity, the necessity of this engagement is clear: the abiding instrumentality of current approaches to interpretability cannot be the only standard by which we measure interpretation with LLMs. One site at which this struggle could take place, I suggest, is the red team.


The Workflow as Medium: A Framework for Navigating Human-AI Co-Creation

Ackerman, Lee

arXiv.org Artificial Intelligence

This paper introduces the Creative Intelligence Loop (CIL), a novel socio-technical framework for responsible human-AI co-creation. Rooted in the 'Workflow as Medium' paradigm, the CIL proposes a disciplined structure for dynamic human-AI collaboration, guiding the strategic integration of diverse AI teammates who function as collaborators while the human remains the final arbiter for ethical alignment and creative integrity. The CIL was empirically demonstrated through the practice-led creation of two graphic novellas, investigating how AI could serve as an effective creative colleague within a subjective medium lacking objective metrics. The process required navigating multifaceted challenges including AI's 'jagged frontier' of capabilities, sycophancy, and attention-scarce feedback environments. This prompted iterative refinement of teaming practices, yielding emergent strategies: a multi-faceted critique system integrating adversarial AI roles to counter sycophancy, and prioritizing 'feedback-ready' concrete artifacts to elicit essential human critique. The resulting graphic novellas analyze distinct socio-technical governance failures: 'The Steward' examines benevolent AI paternalism in smart cities, illustrating how algorithmic hubris can erode freedom; 'Fork the Vote' probes democratic legitimacy by comparing centralized AI opacity with emergent collusion in federated networks. This work contributes a self-improving framework for responsible human-AI co-creation and two graphic novellas designed to foster AI literacy and dialogue through accessible narrative analysis of AI's societal implications.


Evaluating Control Protocols for Untrusted AI Agents

Kutasov, Jon, Loughridge, Chloe, Sun, Yuqi, Sleight, Henry, Shlegeris, Buck, Tracy, Tyler, Benton, Joe

arXiv.org Artificial Intelligence

As AI systems become more capable and widely deployed as agents, ensuring their safe operation becomes critical. AI control offers one approach to mitigating the risk from untrusted AI agents by monitoring their actions and intervening or auditing when necessary. Evaluating the safety of these protocols requires understanding both their effectiveness against current attacks and their robustness to adaptive adversaries. In this work, we systematically evaluate a range of control protocols in SHADE-Arena, a dataset of diverse agentic environments. First, we evaluate blue team protocols, including deferral to trusted models, resampling, and deferring on critical actions, against a default attack policy. We find that resampling for incrimination and deferring on critical actions perform best, increasing safety from 50% to 96%. We then iterate on red team strategies against these protocols and find that attack policies with additional affordances, such as knowledge of when resampling occurs or the ability to simulate monitors, can substantially improve attack success rates against our resampling strategy, decreasing safety to 17%. However, deferring on critical actions is highly robust to even our strongest red team strategies, demonstrating the importance of denying attack policies access to protocol internals.


Learning Large-Scale Competitive Team Behaviors with Mean-Field Interactions and Online Opponent Modeling

Jeloka, Bhavini, Guan, Yue, Tsiotras, Panagiotis

arXiv.org Artificial Intelligence

While multi-agent reinforcement learning (MARL) has been proven effective across both collaborative and competitive tasks, existing algorithms often struggle to scale to large populations of agents. Recent advancements in mean-field (MF) theory provide scalable solutions by approximating population interactions as a continuum, yet most existing frameworks focus exclusively on either fully cooperative or purely competitive settings. To bridge this gap, we introduce MF-MAPPO, a mean-field extension of PPO designed for zero-sum team games that integrate intra-team cooperation with inter-team competition. MF-MAPPO employs a shared actor and a minimally informed critic per team and is trained directly on finite-population simulators, thereby enabling deployment to realistic scenarios with thousands of agents. We further show that MF-MAPPO naturally extends to partially observable settings through a simple gradient-regularized training scheme. Our evaluation utilizes large-scale benchmark scenarios using our own testing simulation platform for MF team games (MFEnv), including offense-defense battlefield tasks as well as variants of population-based rock-paper-scissors games that admit analytical solutions, for benchmarking. Across these benchmarks, MF-MAPPO outperforms existing methods and exhibits complex, heterogeneous behaviors, demonstrating the effectiveness of combining mean-field theory and MARL techniques at scale.


From Firewalls to Frontiers: AI Red-Teaming is a Domain-Specific Evolution of Cyber Red-Teaming

Sinha, Anusha, Grimes, Keltin, Lucassen, James, Feffer, Michael, VanHoudnos, Nathan, Wu, Zhiwei Steven, Heidari, Hoda

arXiv.org Artificial Intelligence

A red team simulates adversary attacks to help defenders find effective strategies to defend their systems in a real-world operational setting. As more enterprise systems adopt AI, red-teaming will need to evolve to address the unique vulnerabilities and risks posed by AI systems. We take the position that AI systems can be more effectively red-teamed if AI red-teaming is recognized as a domain-specific evolution of cyber red-teaming. Specifically, we argue that existing Cyber Red Teams who adopt this framing will be able to better evaluate systems with AI components by recognizing that AI poses new risks, has new failure modes to exploit, and often contains unpatchable bugs that re-prioritize disclosure and mitigation strategies. Similarly, adopting a cybersecurity framing will allow existing AI Red Teams to leverage a well-tested structure to emulate realistic adversaries, promote mutual accountability with formal rules of engagement, and provide a pattern to mature the tooling necessary for repeatable, scalable engagements. In these ways, the merging of AI and Cyber Red Teams will create a robust security ecosystem and best position the community to adapt to the rapidly changing threat landscape.


PillagerBench: Benchmarking LLM-Based Agents in Competitive Minecraft Team Environments

Schipper, Olivier, Zhang, Yudi, Du, Yali, Pechenizkiy, Mykola, Fang, Meng

arXiv.org Artificial Intelligence

Abstract--LLM-based agents have shown promise in various cooperative and strategic reasoning tasks, but their effectiveness in competitive multi-agent environments remains underexplored. T o address this gap, we introduce PillagerBench, a novel framework for evaluating multi-agent systems in real-time competitive team-vs-team scenarios in Minecraft. It provides an extensible API, multi-round testing, and rule-based built-in opponents for fair, reproducible comparisons. We also propose T actiCrafter, an LLM-based multi-agent system that facilitates teamwork through human-readable tactics, learns causal dependencies, and adapts to opponent strategies. Our evaluation demonstrates that T actiCrafter outperforms baseline approaches and showcases adaptive learning through self-play. Additionally, we analyze its learning process and strategic evolution over multiple game episodes. T o encourage further research, we have open-sourced PillagerBench, fostering advancements in multi-agent AI for competitive environments. Witnessing rapid advancements, Large Language Models (LLMs) have emerged as powerful tools for complex reasoning, decision-making, and facilitating multi-agent collaboration [27, 20, 22]. This has driven increasing interest in developing cooperative multi-agent systems [3, 7], leading to the creation of benchmarks based on diverse cooperative games such as Minecraft [4] and Overcooked [1]. Minecraft, in particular, has become an important platform due to its open-ended environment and rich state and action spaces [19, 25]. However, current Minecraft-based benchmarks mainly address cooperative tasks characterized by stationary dynamics and fixed objectives, making them insufficient for evaluating adaptability and strategic decision-making in competitive, dynamic environments. Traditional reinforcement learning benchmarks like StarCraft Multi-Agent Challenge (SMAC) [16] and Lux AI Challenge [18] introduce instability and nonstationarity through competitive adversaries but lack the rich, open-ended interactions found in Minecraft. Bridging this gap by integrating both cooperative and competitive elements within a single dynamic environment is essential to rigorously assess the adaptability and generalizability of advanced multi-agent systems.


Who's the Evil Twin? Differential Auditing for Undesired Behavior

Balappanawar, Ishwar, Vattikuti, Venkata Hasith, Kintzley, Greta, Azimi-Mancel, Ronan, Golechha, Satvik

arXiv.org Artificial Intelligence

Detecting hidden behaviors in neural networks poses a significant challenge due to minimal prior knowledge and potential adversarial obfuscation. We explore this problem by framing detection as an adversarial game between two teams: the red team trains two similar models, one trained solely on benign data and the other trained on data containing hidden harmful behavior, with the performance of both being nearly indistinguishable on the benign dataset. The blue team, with limited to no information about the harmful behaviour, tries to identify the compromised model. We experiment using CNNs and try various blue team strategies, including Gaussian noise analysis, model diffing, integrated gradients, and adversarial attacks under different levels of hints provided by the red team. Results show high accuracy for adversarial-attack-based methods (100\% correct prediction, using hints), which is very promising, whilst the other techniques yield more varied performance. During our LLM-focused rounds, we find that there are not many parallel methods that we could apply from our study with CNNs. Instead, we find that effective LLM auditing methods require some hints about the undesired distribution, which can then used in standard black-box and open-weight methods to probe the models further and reveal their misalignment. We open-source our auditing games (with the model and data) and hope that our findings contribute to designing better audits.


Human-Robot Red Teaming for Safety-Aware Reasoning

Sheetz, Emily, Zemler, Emma, Savchenko, Misha, Rainen, Connor, Holum, Erik, Graf, Jodi, Albright, Andrew, Azimi, Shaun, Kuipers, Benjamin

arXiv.org Artificial Intelligence

-- While much research explores improving robot capabilities, there is a deficit in researching how robots are expected to perform tasks safely, especially in high-risk problem domains. Robots must earn the trust of human operators in order to be effective collaborators in safety-critical tasks, specifically those where robots operate in human environments. We propose the human-robot red teaming paradigm for safety-aware reasoning . We expect humans and robots to work together to challenge assumptions about an environment and explore the space of hazards that may arise. This exploration will enable robots to perform safety-aware reasoning, specifically hazard identification, risk assessment, risk mitigation, and safety reporting. We demonstrate that: (a) human-robot red teaming allows human-robot teams to plan to perform tasks safely in a variety of domains, and (b) robots with different embodiments can learn to operate safely in two different environments--a lunar habitat and a household--with varying definitions of safety. T aken together, our work on human-robot red teaming for safety-aware reasoning demonstrates the feasibility of this approach for safely operating and promoting trust on human-robot teams in safety-critical problem domains. I. INTRODUCTION Enabling robots to reason over risks is a crucial capability of performing collaborative assistive tasks in safety-critical domains.


Control Tax: The Price of Keeping AI in Check

Terekhov, Mikhail, Liu, Zhen Ning David, Gulcehre, Caglar, Albanie, Samuel

arXiv.org Artificial Intelligence

The rapid integration of agentic AI into high-stakes real-world applications requires robust oversight mechanisms. The emerging field of AI Control (AIC) aims to provide such an oversight mechanism, but practical adoption depends heavily on implementation overhead. To study this problem better, we introduce the notion of Control tax -- the operational and financial cost of integrating control measures into AI pipelines. Our work makes three key contributions to the field of AIC: (1) we introduce a theoretical framework that quantifies the Control Tax and maps classifier performance to safety assurances; (2) we conduct comprehensive evaluations of state-of-the-art language models in adversarial settings, where attacker models insert subtle backdoors into code while monitoring models attempt to detect these vulnerabilities; and (3) we provide empirical financial cost estimates for control protocols and develop optimized monitoring strategies that balance safety and cost-effectiveness while accounting for practical constraints like auditing budgets. Our framework enables practitioners to make informed decisions by systematically connecting safety guarantees with their costs, advancing AIC through principled economic feasibility assessment across different deployment contexts.


An Example Safety Case for Safeguards Against Misuse

Clymer, Joshua, Weinbaum, Jonah, Kirk, Robert, Mai, Kimberly, Zhang, Selena, Davies, Xander

arXiv.org Artificial Intelligence

Existing evaluations of AI misuse safeguards provide a patchwork of evidence that is often difficult to connect to real-world decisions. To bridge this gap, we describe an end-to-end argument (a "safety case") that misuse safeguards reduce the risk posed by an AI assistant to low levels. We first describe how a hypothetical developer red teams safeguards, estimating the effort required to evade them. Then, the developer plugs this estimate into a quantitative "uplift model" to determine how much barriers introduced by safeguards dissuade misuse (https://www.aimisusemodel.com/). This procedure provides a continuous signal of risk during deployment that helps the developer rapidly respond to emerging threats. Finally, we describe how to tie these components together into a simple safety case. Our work provides one concrete path -- though not the only path -- to rigorously justifying AI misuse risks are low.